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Abstract 

Background: Oenococcus oeni, a member of the lactic acid bacteria, is one of a limited number of microorganisms 
that not only survive, but actively proliferate in wine. It is also unusual as, unlike the majority of bacteria present in 
wine, it is beneficial to wine quality rather than causing spoilage. These benefits are realised primarily through 
catalysing malolactic fermentation, but also through imparting other positive sensory properties. However, many of 
these industrially-important secondary attributes have been shown to be strain-dependent and their genetic basis it 
yet to be determined. 

Results: In order to investigate the scale and scope of genetic variation in 0. oeni, we have performed whole- 
genome sequencing on eleven strains of this bacterium, bringing the total number of strains for which genome 
sequences are available to fourteen. While any single strain of 0. oeni was shown to contain around 1800 protein- 
coding genes, in-depth comparative annotation based on genomic synteny and protein orthology identified over 
2800 orthologous open reading frames that comprise the pan genome of this species, and less than 1200 genes 
that make up the conserved genomic core present in all of the strains. The expansion of the pan genome relative 
to the coding potential of individual strains was shown to be due to the varied presence and location of multiple 
distinct bacteriophage sequences and also in various metabolic functions with potential impacts on the industrial 
performance of this species, including cell wall exopolysaccharide biosynthesis, sugar transport and utilisation and 
amino acid biosynthesis. 

Conclusions: By providing a large cohort of sequenced strains, this study provides a broad insight into the genetic 
variation present within O. oeni. This data is vital to understanding and harnessing the phenotypic variation present 
in this economically-important species. 
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Background 

Like many fermented foods and beverages, wine repre- 
sents a historical method of nutrient preservation that 
relies on suppressing the growth of spoilage microorgan- 
isms in order to provide long-term storage. In finished 
wine, it is the physiochemical combination of high levels 
of ethanol and sulfur dioxide, scarcity of "preferred" nu- 
trient sources and low pH that combine to produce a 
harsh environment in which all but a small number of 
microorganisms can proliferate. 
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Oenococcus oeni, a member of the lactic acid bacteria 
(LAB), is one of the limited number of bacterial species 
that not only survive, but actively grow in wine. In fact, 
O. oeni is present at extremely-low to undetectable levels 
on intact grapes or in the general environment, with 
wine seemingly representing the exclusive niche of this 
bacterium, despite the seasonal nature of wine produc- 
tion [1-3]. It is therefore fortunate that O. oeni provides 
positive attributes with regard to wine quality, as the 
majority of other wine-associated bacterial species are 
linked with spoilage [4]. The primary recognised role of 
O. oeni in winemaking is in performing malolactic fer- 
mentation (MLF), a de-acidification reaction in which 
malic acid is decarboxylated to lactic acid [5]. However, 
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in addition to performing MLF, there is evidence that 
the growth of O. oeni in wine impacts on flavor, aroma 
and mouth-feel in a strain-specific manner [6-8]. 

Recently, the genome sequences of three O. oeni 
strains, PSU-1 [9,10], BAA-1163 and AWRIB429 [11] 
were released. Genomic comparisons indicated that, for 
a free-living microorganism, O. oeni has a compact gen- 
ome of approximately 1.8 Mb, which presumably reflects 
genomic streamlining that has occurred during its adap- 
tation to the ecologically-restricted niche of fermenting 
grape juice and wine [10,12,13]. However, despite its 
already streamlined genome the three strains of O. oeni 
were shown to display significant inter-strain genomic 
variation with the potential for up to 10% variation in 
protein coding genes [11]. 

To define the scale and scope of the pan and core gen- 
omic potential of this important industrial species, we 
have sequenced a further eleven strains of O. oeni from 
both commercial and environmental sources, bringing 
the total number of available genomes for this species to 
fourteen. Comparison of this expanded group of isolates 
has identified additional variation across the species, 
defined clear multi-strain groups with conserved sets of 
strain-specific genes and genomic deletions, and has 
provided genetic bases for phenotypic characteristics 
that separate specific strains. 

Results and discussion 

The genome sequences for eleven strains of O. oeni were 
each assembled from 1 x 10 6 Alumina sequencing reads 
(100 bp, paired-end library) using MIR A. Of the eleven 
strains, six were chosen based upon diverse microarray- 
based comparative genome hybridization profiles [11], 
while the remainder were selected as Australian environ- 
mental isolates acquired over five decades in addition to 
the official type strain of O. oeni (DSM2052; 
AWRIB129). Manual curation of each assembly resulted 
in the genomic sequence of each strain being captured 
in a small number of single-copy contigs separated by 
low-copy number repeats whose length was greater than 
that used for the paired-end library construction 
(-500 bp) (Table 1). Using conserved genomic synteny, 
these contigs were then rationally assembled into 
chromosomal super-contigs through positioning these 
repeat sequences in each of their multiple genomic loca- 
tions (See Additional file 1). This resulted in near 
complete chromosomal assemblies for each strain (aver- 
age 2.36 assembly gaps per strain), providing a solid 
foundation for inter-strain comparison (Table 1). 

In addition to chromosomal-associated nucleotide 
sequences, there were small circular replicons that were 
identified in at least four strains (AWRIB419, 
AWRIB422, AWRIB568 and AWRIB578). While this 
plasmid was of similar size across the four strains and 



Table 1 Assembly statistics 



Strain 


Total 
contigs 


Repeat 
contigs 


Chromosomal 
super contigs 


Plasmids 


Unplaced 
contigs 
(scaffolds) 


AWRIB129 


42 


13 


1 


0 


0 


AWRIB202 


36 
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0 
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AWRIB304 


38 
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AWRIB418 


34 


10 


4 


0 


0 


AWRIB419 


46 


13 


4 


1 


0 


AWRIB422 


32 


7 


6 


1 


0 


AWRIB548 


29 


8 


2 


0 


0 


AWRIB553 


32 


9 


1 


0 


0 


AWRIB568 


31 


7 


6 


1 


0 


AWRIB576 


28 


7 


3 


1 


0 



often encoded many proteins with similar predicted 
functions (Table 1), there appears to be at least three 
distinct types. In each case, the plasmid appears to en- 
code only those functions required for replication and 
maintenance, with the exception of a putative glycerate 
dehydrogenase (2-hydroxyacid dehydrogenase) found in 
plasmids from AWRIB422, AWRIB568 and AWRIB576 
and a putative NADH:flavin oxidoreductase found in 
plasmids from all four strains. At this point, the selective 
advantage (if any) provided by these plasmids and these 
enzymatic functions remains to be determined. 

Pan and core genomes of O. oeni 

Previous comparative genomic research was able to 
show that there was significant genomic variability be- 
tween three O. oeni isolates for which genome sequence 
information was available prior to this study [9-11,14]. 
In order to examine the extent of protein-coding vari- 
ability across this expanded set of fourteen strains, open 
reading frames (ORFs) were predicted from each gen- 
omic dataset (See Methods) and this information was 
used to produce detailed genomic orthology compari- 
sons that were based upon sequence homology com- 
bined with genomic synteny (See Additional file 2). As a 
basis for orthology mapping, the original genomic anno- 
tation of O. oeni strain PSU-1 was used as a reference 
annotation as this was both the first O. oeni strain to be 
sequenced and is also the only strain for which a 
complete, finished genome sequence is available [9,10]. 

On average, each strain was predicted to contain 
1800 ± 52 full length ORFs and 104 ± 27 potential pseu- 
dogenes associated with the chromosomal replicon (Fig- 
ure 1), with some strains also containing ORFs (n<50) 
associated with the plasmid replicon. However, despite 
the relatively tight distribution in the number of protein- 
coding genes in each strain, there is considerable 
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Figure 1 Core and pan genomes of O. oeni. A. Proportions of core (conserved) ORFs, strain-specific ORFs and putative pseudogenes in 
fourteen strains of O. oeni. The overall number of each type of ORF is also indicated based upon syntenic assignment of orthologs across all 
fourteen strains. B. The estimated average size of the O. oeni core (pink solid line) and pan (light blue solid line) genomes as a function of the 
number of individual strains compared. For each point, the size of core and pan genome was calculated for all combinations of x strains from the 
fourteen strains analysed with the results presented as filled circles. 



variation in the subset of ortholgous genes present in 
each strain. In order to quantify this variation in coding 
potential, the extent of the core and pan genomes of this 
collection of O. oeni strains were calculated. There were 
2846 non-degenerate ORFs that were shown to comprise 
the chromosomal pan genome of this group of O. oeni 
strains, with 1165 of these representing core ORFs con- 
served across all fourteen strains (Figure 1). As observed 
for other bacterial species, the size of the conserved core 
of protein from O. oeni decreases as a function of the 
number of strains compared, while size of the pan gen- 
ome increases [15-17] (Figure IB). Also, given that the 
rate of expansion of the pan genome showed no signs of 
significant decrease as additional numbers of strains 
were added to the analysis, it appears that the genetic di- 
versity present within this strain has not yet been ex- 
haustively recorded. 

In total, there were 1064 non-degenerate ORFs that 
were predicted to encode full-length functional proteins 
(based on homology searches) in at least one of the 
strains (in addition to 93 pseudogenes) that are absent in 



the original PSU-1 gene annotation [9,10] (Additional 
file 3). Of these, 64 are due to annotation differences 
and are found in the PSU-1 genome via the annotation 
pipeline applied in this study compared to that used in 
the original annotation of the PSU-1 genome. For an 
additional 58 of these, pseudogenes exist in PSU-1 
whereas full-length proteins are present in at least one 
other strain. The remaining 942 ORFs are the result of 
strain-specific insertion events in strains other than 
PSU-1. In addition, over one third (348) of these non- 
PSU-1 proteins display their closest homology to pro- 
teins from outside of O. oeni (including protein-coding 
genes from BAA-1163 and AWRIB429) and represent 
new additions to the O. oeni pan genome. 

Horizontal gene transfer 

In order to determine if any of the strain-specific genes 
in the O. oeni genome were the result of horizontal gene 
transfer (HGT), the genome of each strain was interro- 
gated for regions with an increased probability of being 
horizontally-acquired [18]. While there were numerous 
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regions that exceeded the threshold for being potentially 
horizontally acquired (Additional file 4), one region, 
present in at least seven of the strains, had a very high 
probability of resulting from HGT , (Figure 2A). This re- 
gion was subsequently shown to contain evidence of two 
independent HGT events (separated by -65 kb) involv- 
ing IS element insertions that appear to have been hori- 
zontally transferred from Lactobacillus spp. The first of 
these (pan_genome loci 1725-1734) appears to be 



associated with an IS30 element insertion that was first 
discovered in AWRIB429 [11,19]. The second region 
(pan_genome loci 1802-1816) is associated with an in- 
sertion event within an ORF that encodes an arginine 
deaminase (OEOE_1118 of PSU-1). The last 3.5 kb of 
this 7.5 kb fragment has -99% identity to a large portion 
of a genetic element that has been characterised in beer- 
spoilage strains of Lactobacillus spp. and Pedicoccus 
spp. [20]. This element has been postulated to be 
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Figure 2 Horizontal gene transfer in O. oeni. A. The probability of genomic regions being present due to horizontal gene transfer (HGT). A 
region of high probability of strain-specific HGT is indicated (black arrow). B. Maximum-likely hood phylogeny of a horizontally-acquired 
transmembrane protein within the region highlighted in (A). Bacterial genera are indicated by colored shading {Oenococcus spp., blue; 
Lactobacillus spp., pink; Weissella spp., green; Pediococcus spp., yellow). C. Homology between 0. oeni and Lactococcus spp. A representative 
protein for each pan_genome locus was used in homology searches against the combined predicted protein sequences from each Lactococcus 
spp. genome available in Genbank. Average identity values were calculated using a 10 ORF sliding window (blue line). Regions predicted to be 
result of HGT in which >5 adjacent ORFs displayed >90% amino acid identity between an 0. oeni protein and a protein from Lactococcus spp. are 
also indicated (red boxes). 
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horizontally-transferred and, due to the presence of the 
HorC efflux pump, to provide resistance to antibacterial 
compounds present in hops [21]. However, the portion 
of the element present in strains of O. oeni lacks the 
horC gene, while still encoding the glycosyltransferase, 
an integral membrane protein, and a cell wall teichoic 
acid glycosylation protein that are present in the 3' half 
of the element. As expected for a HGT event, these pro- 
teins share a higher than expected degree of relatedness 
compared to the evolutionary distance that separates 
Oenococcus spp. and Lactobacillus spp. (Figure 2B, Add- 
itional file 2). 

There were at least another five genomic insertions in 
various subsets of the strains (including regions within 



the extra-chromosomal replicons) that appear to be the 
result of HGT. Based on homology searches, the HGT 
events giving rise to these elements, like those men- 
tioned previously, appear to have originated from Lacto- 
bacillus spp. (Figure 2C, Additional file 5). It therefore 
appears that Lactobacilli provide a potential reservoir of 
genes for O. oeni. 

Highly variable bacteriophage integration across strains 

One of the most striking variations in intra-specific cod- 
ing potential across the O. oeni strains was in the num- 
ber and position of temperate bacteriophage integrations 
(Figure 3). O. oeni has been shown previously to harbour 
at least four separate bacteriophages that integrate 
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Figure 3 Bacteriophage diversity. A. Sites of bacteriophage integration throughout the 0. oeni genome are indicated by along with the name 
of the specific tRNA that represents the bacteriophage attachment site. Other tRNA genes predicted to be present in the 0. oeni genome are also 
indicated (half-height black lines). Full bacteriophage elements (pink boxes) are characterised by the presence of both integrase (//if) and lytic (lys) 
enzymes while bacteriophage fragments (green boxes) generally only contain int ORFs. All elements are drawn to scale. B. Maximum-likelyhood 
phylogenetic analysis of bacteriophage int proteins from 0. oeni. Colored shading is used to group proteins based on their site of integration and 
to define their origin as either being from a full bacteriophage element (pink) or a bacteriophage fragment (green). Previously identified phage 
protein sequences are indicated in bold. C. Maximum-likelyhood phylogenetic analysis of bacteriophage lys proteins from 0. oeni. Shading is 
identical to that in part B. 
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through tRNA-associated attachment sites (fOgPSUl, 
fOg44, fOg30 and O10MC) [22-25]. In this study, a total 
of six different tRNAs were shown to potentially be 
involved in the integration of bacteriophage, with four 
shown to be current sites of insertion of full-length and 
presumably functional bacteriophage (Figure 3A). The 
remaining two tRNAs (plus a third in AWRIB129) con- 
tained bacteriophage remnants and may represent sites 
at which integration and then subsequent excision of a 
bacteriophage has occurred. It was also apparent that for 
two of the insertion events (at OEOE_t0506 in 
AWRIB548 and OEOE_t685 in AWRIB304), the entire 
bacteriophage sequence had been tandemly duplicated at 
the integration site. 

In order to determine phylogenetic relationships be- 
tween bacteriophage sequences, genes encoding the 
highly conserved bacteriophage integrase (int) and endo- 
lysin (lyt) proteins (if present) were used to construct 
phylogenies for each integration event. For the integrase 
proteins (Figure 3B), it was possible to show that most 
proteins grouped according to the particular tRNA at 
which the integration occurred, with the exception of 
two int ORFs from the bacteriophage fragments in 
AWRIB129; the latter was significantly different from 
the other int ORFs present at the same tRNA site. This 
linking of tRNA and int sequence also extended to the 
known sequences of fOgPSUl, fOg44 and O10MC that 
were found to group according to their previously iden- 
tified tRNA sites [23,25]. However, this linking between 
site of tRNA integration and int sequence did not hold 
for the lys genes (Figure 3C), where there was consider- 
able variation in the sequence of the endolysins encoded 
by the ORFs found at each tRNA sequence between the 
previously known bacteriophage elements and those 
found in this study. 

Phylogenetic diversity 

Independent of coding-region predictions, it was pos- 
sible to determine the phylogenetic relationship of the 
various strains from the patterns of single-nucleotide 
polymorphisms deduced from whole-genome nucleotide 
alignments (Figure 4A). The phylogeny produced from 
this alignment led to two major findings. First there was 
a large evolutionary distance between a basal clade 
formed by AWRIB418 and BAA-1163 and the other 
twelve strains of O. oeni, suggesting that these two 
strains form a distinct evolutionary group, a finding that 
is supported by the results of previous MLST typing of 
typing [26]. In order to investigate this apparent division, 
a second phylogeny was constructed using the predicted 
sequences of the core, conserved proteins present in the 
fourteen O. oeni strains, in addition to orthologous 
sequences from O. kitaharae DSM 17330 as an outgroup 
[27] (Figure 4C). This phylogeny is consistent with 



BAA-1163 and AWRIB418 comprising a basal, divergent 
clade, however the genetic distance between these 
groups of strains is far less than observed between any 
strain of O. oeni and O. kitaharae. As such, BAA-1163 
and AWRIB418 may together represent a divergent sub- 
species of O. oeni, a fact that is supported by the pres- 
ence of a large number of ORFs that are found only in 
these two strains (See below). 

The second finding from the O. oeni strain phylogeny, 
is the very close genetic relationship between the strains 
AWRIB429, AWRIB576, AWRIB568, AWRIB304, 
AWRIB318 and AWRIB202. Interestingly, all of these 
strains, with the exception of AWRIB429 (Lalvin VP41) 
which was isolated in Italy, was isolated from natural 
wine ferments in Australia over a fifty year period 
(Table 2) (Figure 4A). Given the unexpected nature of 
the phylogenetic relationship between the Australian iso- 
lates and AWRIB429, the identity of AWRIB429 as an 
isolate of VP41 was confirmed by sequencing one of the 
cell wall loci (locus 2) from a second, independent iso- 
late of VP41 which was obtained from the supplier of 
this strain (AWRIB551). This second isolate has a nu- 
cleotide sequence that was 100% identical to that of 
AWRIB429, indicating that these two strains most likely 
both represent Lalvin VP41 and that the observed phyl- 
ogeny is in fact a true representation. However the cause 
of this relationship between VP41 and the Australian 
isolates remains to be determined, but it is interesting to 
note that the isolation of many of the Australian strains 
predates the use of Lalvin VP41 as a commercial malo- 
lactic starter strain, therefore environmental introduc- 
tion of this strain through winery inoculation can be all 
but excluded as a hypothesis. 

Cell wall exopolysaccharide variation 

Using whole-genome sequence data, it was shown previ- 
ously that significant differences existed in the compos- 
ition of the exopolysaccharide (EPS) operons of O.oeni 
PSU-1, BAA-1163 and AWRIB429 [11,31]. In the 
current expanded study, it was possible to define three 
separate clusters of ORFs that are potentially associated 
with EPS production and that show substantial variation 
across the fourteen strains (Figure 4B; Additional file 6). 
At each of the three loci, there was a variable number of 
genomic "cassettes" (locus 1 has 7 different genetic cas- 
settes, locus 2, 3 cassettes and locus 3, three cassettes) 
providing the basis for a high degree of potential intras- 
pecific diversity in the composition of O. oeni cell wall. 

The combination of independent cell wall loci in each 
strain was consistent with the whole-genome phyloge- 
nies and defined three main groups of strains (Figure 4B; 
Additional file 6). It is predicted that these groups 
therefore comprise three distinct cell wall types that 
approximate serotype variation that is observed in 
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some species of pathogenic bacteria [32]. Also, as bac- 
terial EPS can potentially interact with, and therefore 
modulate components of wine, there is the potential 
for these different EPS genotypes to differentially 
affect wine quality. This is consistent with reported 
differences in both the amount and type of EPS pro- 
duced by at least three of the strains in this study that 
contain different EPS operon combinations (PSU-1 
BAA-1163 and AWRIB429) [31]. 



Sugar transport and utilisation 

The range of sugars that O. oeni is capable of utilising is 
strain dependent [33]. To investigate the genetic basis of 
differences in sugar metabolism, we began by classifying 
genes that are predicted to encode phosphotransferase 
system (PTS) enzyme II sugar transporters in the O. oeni 
pan genome. In total, there were 46 PTS subunits that 
grouped into 18 separate genomic loci (individual subu- 
nits being adjacent in the genome) (Figure 5). Fourteen 
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Table 2 Strains used in this study 



Strain 


Other isolate name(s) 


Commercial supplier 


Place of origin 


Reference^ 


Genbank accession number 


AWRIB129 


Type strain; DSM 20252 




France 


[28]; This study 


AJTP00000000 


AWRIB697 


BAA-1 163; IOB8413 








AAUV00000000 


AWRIB563 


PSU-1 




USA (1972) 


[29] 


NC_008528 


AWRIB429 


Lalvin VP41 


Lallemand 


Abruzzi region, Italy 


[1 1]; This study 


ACSE00000000 


AWRIB418 


MCW 


Lallemand 


Sonoma Country, USA 


This study 


ALAEOOOOOOOO 


AWRIB419 


Lalvin EQ54 


Lallemand 


Cote de Rhone, France 


This study 


ALAFOOOOOOOO 


AVVKI DJjJ 


tnos- 1 


Lallemand 


Tuchan Languedoc, France 


This study 


ALAIUUUUUUUU 


AWRIB548 


MT01; BL01 


Lallemand 


France 


This study 


ALAHOOOOOOOO 


AWRIB422 


Viniflora CH35 


Chr Hansen 




This study 


ALAGOOOOOOOO 


AWRIB202 






Coonawarra, Australia (1994) 


[11]; This study 


AJTOOOOOOOOO 


AWRIB304 






Adelaide Hills, Australia (1999) 


[11]; This study 


AJIJOOOOOOOO 


AWRIB318 


NCDO 1884 




Merbein, Australia (1956/57) 


[30]; This study 


ALADOOOOOOOO 


AWRIB568 






Mt Mary, Australia (2008) 


This study 


ALAJOOOOOOOO 


AWRIB576 






Mt Mary, Australia (2008) 


This study 


ALAKOOOOOOOO 



of these loci contained either individual or multi-domain 
ORFs encoding IIA, IIB and IIC functions (and in two 
cases IID) in at least one strain and would therefore be 
expected to encode fully functional transporter com- 
plexes (Figure 5). Correlating with differences in carbo- 
hydrate utilisation, only three of these complexes were 
conserved across all fourteen strains (in addition to three 
others in which strains contained potential pseudogenes 
in otherwise complete complexes). 

Interestingly, for one specific PTS cluster (present in 
AWRIB548 and AWRIB422), there appears to have been 
a strain-dependent genomic insertion event that resulted 
in the incorporation of two different IIA, IIB duplexes 
with a conserved IIC subunit (pan_genome loci 2350 to 
2358) (Figure 5B). Accompanying this exchange of IIB 
and IIC subunits, this genomic insertion event has 
resulted in the formation of a fused ORF encoding the 
conserved PTS IIC subunit at the 3' end and a sequence 
that encodes a protein with high homology to glycosi- 
dases at the 5' end. There have been many instances of 
PTS subunit protein fusions reported previously [34]. 
However, the PTS HC-glycosidase fusion found in 
AWRIB548 and AWRIB422 appears to be novel, as 
homology searches provide significant matches to both 
the amino- and carboxyl- sections of the fusion protein 
in isolation, but lack single protein matches across the 
entire length of the predicted fusion protein (Additional 
file 7). While the function of this protein remains to be 
determined, it is tempting to speculate that the glycosi- 
dase domain of the fusion can catalyse the release of 
sugars that can be subsequently transported into the cell 
via the PTS system. 

In addition to intra-specific variation in the PTS trans- 
porter systems of O. oeni, there were also differences in 



metabolic pathways for sugar utilisation (Figure 6). Ara- 
binose and xylose are two sugars that have been noted 
as displaying strain-dependent utilisation profiles in O. 
oeni [33]. While genes imparting the ability to utilise xy- 
lose were not evident in the sequenced strains, the data 
supports strain-dependent metabolic potential to utilise 
L-arabinose (including the arabinose polymer arabinan) 
and L-xylulose (Figure 6A). 

The potential to utilise L-xylulose is predicted to be 
dependent on an insertion event found in nine of the 
fourteen strains, which introduces the three enzymes 
predicted to be required for conversion of L-xylulose to 
D-xylulose-5-phosphate (for entry in the pentose phos- 
phate pathway) (Figure 6B). L-arabinose utilisation is 
also encoded by a set of three enzymes. However, unlike 
the L-xylulose cluster, the genomic locus containing the 
arabinose metabolism associated ORFs was found to be 
present in all fourteen strains, with the intraspecific util- 
isation of L-arabinose being predicted to be due to the 
presence of nonsense mutations in one or more of the 
three ORFs in some strains (Figure 6C). In addition to 
the ability to utilise L-arabinose, BAA- 1163 and 
AWRIB418 are predicted to be capable of degrading the 
arabinose polymer arabinan (Figure 6A). Interestingly, 
this ability in AWRIB418 is predicted to be due an inser- 
tion event specific to this strain and BAA- 11 63, however 
the ORF in BAA- 11 63 ORF is interrupted by a nonsense 
mutation. In contrast, BAA- 11 63 is predicted to utilise 
arabinan through the function of a conserved genomic 
locus that is predicted to be a pseudogene in all but this 
strain (Figure 6C). 

In addition to their ability to utilise arabinan, 
AWRIB418 and BAA-1163 also encode a functional a- 
glucosidase (pan genome ORF 1409; pseudogene in the 
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Figure 6 Variations in sugar utilisation pathways in O. oeni. A. The predicted pathway for the assimilation of L-arabinose and L-xylulose in 0. 
oeni. EC numbers are provided for each enzymatic step. 0. oeni strains displaying metabolic blocks at individual enzymatic steps are also 
indicated. Arrows are color-coded according to the location of the ORF in parts B and C. B. A strain-specific genomic insertion predicted to 
impart the ability to utilise L-xylulose. Individual ORFs surrounding this locus are shown for each strain. Orthologous ORFs are positioned vertically 
and are color coded according to their conservation status (conserved, pink; strain-specific, light-blue). ORFs predicted to be directly involved in 
the utilisation of L-xylulose are color-coded according to their metabolic role (green, L-xylulose utilisation, orange, L-xylulose and L-arabinose 
utlisation). C. A strain-specific genomic insertion predicted to impart the ability to utilise arabinan and L-arabinose. Shading is identical to part B 
except for the presence of ORFs involved directly in the utilisation of arabinan and L-arabinose (shaded yellow). 
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other strains), which is predicted to allow these strains 
to utilise sucrose (a rare-trait in O. oeni [33]), while the 
BAA- 11 63 genome is also predicted to encode a L-iditol 
2-dehydrogenase (pan genome ORF 2413; strain-specific 
insertion) that would allow for the utilisation of D- 
sorbitol through its conversion to D-fructose. Interest- 
ingly, while the utilisation of fructose by O. oeni is a 
ubiquitous property of these strains [33], the metabolic 
pathway for the catabolism of this sugar is lacking in 
the PSU-1 genome due to a frameshift in the coding re- 
gion of the gene encoding fructose-bisphosphate aldo- 
lase (4.2.1.14). The full-length coding region is present 
in all other sequenced strains of O. oeni and this infor- 
mation, combined with the phenotypic data available for 
O. oeni, would suggest that this difference likely repre- 
sents a sequencing error in the current PSU-1 
sequence. 

Amino acid biosynthesis 

O. oeni has a variety of amino acid auxo trophies with 
several showing intraspecific differences [28]. Initial ana- 
lysis of the genome of O. oeni PSU1 suggested the pres- 
ence of biosynthetic pathways for up to eight amino 
acids [10]. In the expanded dataset provided by the four- 
teen genome sequences generated in this study, it is ap- 
parent that there are instances of intra- specific diversity 
in the ability to synthesise specific amino acids (Table 3). 
One of these main differences is in the predicted ability 
of the divergent O. oeni strains BAA-1163 and 
AWRIB418 to potentially synthesise the amino acid leu- 
cine. While most strains of O. oeni lack the enzyme 3- 
isopropylmalate dehydrogenase (EC: 1.1. 1.85) due to the 
presence of a conserved non-sense mutation in the ORF, 
both BAA-1163 and AWRIB418 are predicted to encode 
a full-length functional enzyme (Additional file 1, pan 
genome ORF 2716). In addition to this potential differ- 
ence in leucine auxotrophy, there are predicted differ- 
ences in the ability to synthesize threonine, glutamine 
and methionine due to the presence of nonsense muta- 
tions in genes involved in these pathways in specific 
strains (Table 3). The findings for these amino acids are 
consistent with previous phenotypic tests that found 
variable responses for these amino acids in this species 
[28]. 

Conclusions 

Like other industrial species, of microorganisms pheno- 
typic variation in O. oeni will have direct economic conse- 
quences through impacts on product quality and 
production efficiencies. A thorough understanding of the 
basis of this variation therefore provides the means to im- 
prove the industrial performance of these strains or to 
easily screen for new strains with multiple, desirable 
traits. This study provides a solid foundation for the 
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Table 3 Strain-dependent differences in amino acid genes 



Amino acid 


Enzyme 


Pan_genome 
number 


Strain(s) 
affected 


L-threonine 


homoserine kinase 


2510 


AWRIB304 




2.7.1.39 




PSU-1 


L-glutamine 


L-glutamate synthetase 
6.3.1.2 


1612 


AWRIB202 


L-methionine 


Aspartate kinase 


2507 


AWRIB304 




2.7.2.4 




AWRIB422 
AWRIB548 




O-succinylhomoserine 
(thiol)-lyase 


1350 


PSU-1 




2.5.1.48 




AWRIB129 



AWRIB202 



AWRIB304 
AWRIB419 
AWRIB422 
AWRIB429 
AWRIB553 
AWRIB568 
AWRIB576 

Methionine synthase II 1895 BAA-1163 
2.1.1.14 AWRIB202 

AWRIB304 
AWRIB318 
AWRIB422 
AWRIB429 
AWRIB548 
AWRIB568 
AWRIB576 

L-leucine 3-isopropylmalate 2716 PSU-1 

dehydrogenase 

1.1.1.85 AWRIB129 

AWRIB202 
AWRIB304 
AWRIB318 
AWRIB419 
AWRIB422 
AWRIB429 
AWRIB548 
AWRIB553 
AWRIB568 
AWRIB576 



investigation of phenotypic diversity in O. oeni by provid- 
ing whole-genome sequences for a large cohort of strains 
from both commercial and environmental sources. As 
such, we have identified significant variation across the 
strains that were investigated, including differences in cell 
wall synthesis and sugar utilisation, that were largely due 
to differential insertion of large, multi-genic nucleotide 
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fragments. These differences can be used to inform re- 
search on the industrial implications of this genetic vari- 
ation while allowing for the identification of strains with 
combinations of desirable genetic, and therefore pheno- 
typic, characteristics. 

Methods 

Strains and growth conditions 

Strains used in this study are listed in Table 2 and were 
selected to represent a cross-section of commonly used 
commercial strains, in addition to Australian environ- 
mental isolates present in the AWRI culture collection. 
To prepare genomic DNA, each strain was grown in 
MRS (Amyl Media, Australia) supplemented with 20% 
apple juice [35] for between six and ten days at 27°C. 
DNA was prepared by phenol chloroform extraction as 
previously described [36]. 

Genome sequencing and assembly 

Genome sequencing was performed at the Ramaciotti 
Centre for Gene Function Analysis (University of New 
South Wales, NSW, Australia) using Illumina sequen- 
cing technology and 2 x 100 bp paired-end sequencing 
reads. For each strain 1 x 10 6 individual reads were de 
novo assembled using MIR A v 3.2.1 and manually 
refined using Seqman Pro (DNAstar). These Whole 
Genome Shotgun projects, including ORF predictions, 
have been deposited at DDBJ/EMBL/GenBank with the 
appropriate accession number for each strain used in 
this study listed in Table 2. 

Genome annotation and comparison 

Potential coding regions were predicted using Glimmer 
[37]. Genomic orthology was assigned via reciprocal 
homology searches using BLAST [38] combined with 
long-range genomic synteny. Intra-specific whole gen- 
ome alignments were produced using FSA [39] and used 
to construct a maximum-likelyhood phylogeny using 
PhyML [40]. For comparison to O. kitaharae, conserved 
protein orthologs were identified via homology (minimum 
60% identity when compared to the homologous O. kita- 
harae protein) in all of the O. oeni genomes used in this 
study. Next, proteins which had potential paralogs (which 
could confound the phylogeny) were identified by assign- 
ing each protein to specific orthoMCL [41] clusters and 
then only retaining those groups of orthologs in which 
each protein was the only member of a particular 
orthoMCL group. Individual protein alignments were 
then performed on each set of homologous sequences 
using Muscle [42]. These individual alignments were then 
concatenated into a single large sequence for each strain 
which was used to construct a maximum-likelihood 
phylogenetic tree using PhyML [40]. Bacteriophage phylo- 
genies were produced by aligning individual predicted 



protein sequences using Muscle [42] and then producing 
maximum-likelyhood trees using PhyML [40]. Enzyme 
function annotations were made using the KAAS annota- 
tor via the KEGG website [43]. Regions of potential hori- 
zontal gene transfer were detected using alien hunter 
using HMM-derived change point detection [18]. 

Additional files 



Additional file 1 Super contig structure. Describes the position of 
each sequencing contig within the chromosomal super contigs in each 
strain. 

Additional file 2 Syntenic alignment and annotation of O. oeni 
open reading frames. Contains information regarding orthology 
relationships and gene function for open reading frames from each 0. 
oeni strain for which sequencing data is available. 

Additional file 3 Non-PSU-1 open reading frames. Contains full 
positional and orthology information for all of the open reading frames 
not found in the "reference" strain PSU-1. 

Additional file 4 Genome-wide results of the "alien hunter" 
algorithm for each strain. Each individual plot has been spaced 
according to the whole-genome alignments to ensure that genomic 
regions are concordant across strains. The area of highest probability of 
HGT is indicated (black arrow). 

Additional file 5 Horizontal gene transfer events between O. oeni 
and Lactococcus spp. A representative protein (query_id) for each 
pan_genome locus was used in homology searches against the 
combined predicted protein sequences from each Lactococcus spp. 
genome (seq_id) available in Genbank. Average identity values were 
calculated using a 10 ORF sliding window (average(n = 10)). Regions in 
which >5 adjacent ORFs displayed >90% amino acid identity between an 
0. oeni protein and a protein from Lactococcus spp. are also indicated 
(window n = 5; identity >= 90). 

Additional file 6 Exopolysaccharide gene clusters. Contains full 
positional and orthology information for all of the open reading frames 
predicted to encode members of the strain-dependent cell wall clusters. 

Additional file 7 A novel glycosidase-PTS gene fusion. BLAST 
homology information for the glycosidase-PTS gene fusion predicted to 
be encoded by pan_genome ORF 2351. 
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